Signature and K-State Honor Code
E. McLaren. "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."
1 Business Understanding
2 Data Understanding
3 Data Preparation
4 Data Visualization
5 Correlation Analysis
6 Findings Summary
References
Understanding and predicting movie success remains a key challenge in the movie industry. Film budgets are incredibly high but can yield great profits if they are managed right. According to researchers in the field, predicting movie success hasn't always been easy. Previously, it was believed that critic reviews or even release schedules played the greated role, but these were simply assumptions at the time that needed more evidence or testing. However, with the assistance of analytic-based approaches, seemingly overwhelming data sets can be turned into insights. As movie researchers and data analysts continue to dig deeper into various data sets, we are able to make more informed decisions as well as recommendations for professionals in the industry.
Using the data set of 5043 movies listed on IMDB, this project will analyze trends among top performing movies based on multiple variables such as score, duration, and genre. It will also dig deeper to understand what makes these movies successful to help make future predictions.
For the purposes of this project, a movie's success will be measured by its IMDB score.
In this step, I import the packages needed to prepare the assignment.
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
#scatter matrix
from pandas.plotting import scatter_matrix
#scipy package
from scipy.stats import mannwhitneyu
#regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn import linear_model
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
#confusion matrix
import scikitplot as skplt
#classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
#clustering
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
First I will load the data set into the report.
# load movie_metadata.csv
df = pd.read_csv('data/movie_metadata.csv')
df.head()
Then I will check the general info of the data set, such as statistics of each column and data types, as well as the correlations between them to get a better understanding of the data.
# check statistics of data set
df.describe()
#check data types of data set
df.info()
From the data types list above, I am able to find a list of object columns that are not useful for further analysis. Though actors could be an interesting way to segment the movies to see which actors were popular among the top movies, the actors in the movie don't always determine the success of a movie. These columns include:
From the correlation below, I am able to identify a list of columns that will likely not be useful in further analysis in the context of movie success due to their low correlation with IMDB score. These columns include:
#check correlation values for all variables with IMDB score to determine relevance for analysis
df.corr()
#find amount of missing values
df.isnull().sum().sort_values(ascending=False)
Another data quality issue with this data set is the missing values. In the code above, I checked for missing values and found there to be a significant amount missing from the gross and budget columns which are incredibly relevant to predicting movie success. Since these data values are incredibly important to determining success, I don't want to fill the missing values with averages or assumptions as this could affect a regression I plan to run later.
Next Steps:
As identified in the previous section, in order to clean the data and prepare it for analysis, I will need to check for duplicate records to ensure these don't skew the results when analyzing.
According to the code I ran below, there are 45 records that have a duplicate. I then view the data and see the records that are duplicated side-by-side to verify the issue.
# count duplicate records, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
df.duplicated().sum()
# examine duplicated rows, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
df.loc[df.duplicated(keep=False), :].sort_values('movie_title').head(10)
Since I can see duplicates exist in the data set I will remove the second instance of the duplicate records so that I keep only one record for each movie. After that, I verify that the 45 extra records have been removed.
# drops the duplicate rows, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
# update dataset to not include the duplicate records, "inplace=True" code sourced from https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas
df.drop_duplicates(keep='first', inplace=True)
#check for missing values
df.shape
In the previous section, I identified that the Gross and Budget columns were missing the most values. Since the data in these are so critical to determining movie success, I won't replace them with average values and will instead drop them to ensure the validity of the data.
#find amount of missing values
df.isnull().sum().sort_values(ascending=False)
#handling missing values: remove the rows with any missing value in the budget or gross column
df = df.dropna(subset=['gross', 'budget'])
df.isnull().sum().sort_values(ascending=False)
df.shape
Though this noticeably reduces the amount of data entries that I have, I still retain a large portion of the data volume and can be certain that the data won't be skewed by inputing assumptions.
In this section, I will remove the columns I identified in thhe previous section that are not relevant to determining the success of a movie.
# remove irrelevant columnns
df_new = df.drop(['actor_3_facebook_likes','actor_1_facebook_likes', 'facenumber_in_poster', 'actor_2_facebook_likes', 'aspect_ratio', 'plot_keywords', 'movie_imdb_link','actor_2_name', 'actor_1_name', 'actor_3_name', 'genres'], axis=1)
df_new.head()
To make the data easier to test with a correlation analysis, I will convert as many object columns into integers so they can be tested.
#Replace current values of Color and Black and White with 0 and 1
df_new = df_new.replace({'color': ' Black and White'}, {'color': 0})
df_new = df_new.replace({'color': 'Color'}, {'color': 1})
#verify color names have been changed
df_new['color'].value_counts()
#get language values to inform categorization
df_new.groupby('language')['movie_title'].count().sort_values(ascending=False)
#Replace current values of other languages and English with 0 and 1 respectively
df_new = df_new.replace({'language': 'English'}, {'language': 1})
df_new = df_new.replace({'language': ['French','Spanish','Mandarin','German','Japanese','Hindi','Cantonese','Italian','Korean','Portuguese','Norwegian','Hebrew','Persian','Dutch','Danish','Thai','Dari','Indonesian','Aboriginal','Icelandic','Hungarian','Arabic','Aramaic','Bosnian','Telugu','Czech','Swedish','Russian','Romanian','Dzongkha','None','Filipino','Mongolian','Maya','Kazakh','Vietnamese','Zulu']}, {'language': 0})
df_new.head()
#verify language names have been changed and grouped accurately
df_new.groupby('language')['movie_title'].count().sort_values(ascending=False)
#get country values to inform categorization
df_new.groupby('country')['movie_title'].count().sort_values(ascending=False)
#Replace current values of other countries and USA with 0 and 1 respectively
df_new = df_new.replace({'country': 'USA'}, {'country': 1})
df_new = df_new.replace({'country': ['UK','France','Germany','Canada','Australia','Spain','Japan','China','India','Hong Kong','New Zealand','Italy','Mexico','South Korea','Denmark','Ireland','Brazil','Norway','Iran','Thailand','Argentina','South Africa','Netherlands','Russia','Israel','Czech Republic','Romania','Iceland','Hungary','Taiwan','Belgium','Chile','Aruba','Colombia','West Germany','Finland','Georgia','Greece','Indonesia','New Line','Official site','Peru','Philippines','Poland','Sweden','Afghanistan']}, {'country': 0})
df_new.head()
#verify country names have been changed and grouped accurately
df_new.groupby('country')['movie_title'].count().sort_values(ascending=False)
#get content rating values to inform categorization and renaming
df_new.groupby('content_rating')['movie_title'].count().sort_values(ascending=False)
#Replace current values of Not Rated, Passed, Approved, Unrated with R, based on historical changes inn rating system: https://www.filmratings.com/History
df_new = df_new.replace({'content_rating': ['Not Rated','Passed', 'Approved', 'Unrated']}, {'content_rating': 'R'})
df_new.head()
#Replace current values of X, M, and GP with NC-17, PG, and PG respectively based on changes in historical rating system: https://www.filmratings.com/History
df_new = df_new.replace({'content_rating': ['M','GP']}, {'content_rating': 'PG'})
df_new = df_new.replace({'content_rating': 'X'}, {'content_rating': 'NC-17'})
df_new.head()
#verify content rating names have been changed and grouped accurately
df_new.groupby('content_rating')['movie_title'].count().sort_values(ascending=False)
#Replace current values of G, PG, PG-13, R, and NC-17, with 1 to 5 respectively for analysis later
df_new = df_new.replace({'content_rating': ['G']}, {'content_rating': 1})
df_new = df_new.replace({'content_rating': ['PG']}, {'content_rating': 2})
df_new = df_new.replace({'content_rating': ['PG-13']}, {'content_rating': 3})
df_new = df_new.replace({'content_rating': ['R']}, {'content_rating': 4})
df_new = df_new.replace({'content_rating': ['NC-17']}, {'content_rating': 5})
df_new.head()
In this step, it is critical that the missing values still remaining are either removed or changed. In this case, I will be using imputation to replace null values with each column's respective average value. It appears that the columns still missing values are:
#find remaining missing values
df_new.isnull().sum().sort_values(ascending=False)
To resolve the missing values for content_rating, I will change all null values to the most common rating, "R," which is denoted by 4 in the data set.
# changing content_rating null value to average "R"
#verify no more null values for content_rating
df_new = df_new.fillna({'content_rating': 4})
df_new.isnull().sum()
To resolve the missing values for color, I will change all null values to their most likely rating by their year. If before 1939, the color will be denoted by 0 for Black and White. If after 1939, the color will be denoted by 1 for Color.
# changing color null values with 1 if after 1939, 0 if before, code based on Canvas discussion board suggestion: https://k-state.instructure.com/courses/81006/discussion_topics/505217?module_item_id=1875123
#verify no more null values for color
df_new.loc[df_new['color'].isnull() & (df_new['title_year'] < 1939),'color'] = 0
df_new.loc[df_new['color'].isnull() & (df_new['title_year'] > 1939),'color'] = 1
df_new.isnull().sum()
To resolve the missing values for language, I will change all null values to their most common result, English, denoted by a 1 in the data set.
# changing language null value with average "1" (English)
#verify no more null values for language
df_new = df_new.fillna({'language': 1})
df_new.isnull().sum()
To resolve the missing values for duration, I will change all null values to the average movie duration of the data set.
# changing duration null value with average duration
#verify no more null values for duration
df_new = df_new.fillna({'duration': df_new['duration'].mean()})
df_new.isnull().sum()
To resolve the missing values for num_critic_for_reviews, I will change all null values to the average number of critic reviews of the data set.
# changing num_critic_for_reviews null value with average
#verify no more null values for num_critic_for_reviews
df_new = df_new.fillna({'num_critic_for_reviews': df_new['num_critic_for_reviews'].mean()})
df_new.isnull().sum()
Now that I have finished changing the values, I verified that there are no more missing values in the step above, and will also check the length of the cleaned data set to verify that it is the expected length. Since 3857 is the correct number of records, I am done resolving missing values.
#verify length of cleaned data set
len(df_new)
In a later step, I will want to see the genres' correlations with IMDB scores. I'll use dummy variables to separate the genres out and save them as df1 to be used later.
#now, separate a string of genres into dummy variables
# I borrowed the code below from Midterm discussion board: https://k-state.instructure.com/courses/81006/discussion_topics/505217?module_item_id=1875123
df1 = df.join(df.pop('genres').str.get_dummies('|'))
df1 = df1[['Action','Adventure','Animation','Biography','Comedy','Crime','Documentary','Drama','Family','Fantasy','Film-Noir','History','Horror','Music','Musical','Mystery','Romance','Sci-Fi','Short','Sport','Thriller','War','Western', 'imdb_score']]
df1.head()
To ensure I completed this step properly, I check the length of the df1 data set to verify that it is the expected length. Since 3857 is the correct number of records, I have correctly separated the genres into dummy variables without adding extra rows.
len(df1)
In order to gain more insights when analyzing the data set, I need to create two extra columns that are calculations based on two other existing columns. The first calculated column is profit, which I calculate as the value if the gross column minus the value in the budget column.
# create new columns
df_new['profit'] = (df_new['gross'] - df_new['budget'])
df_new.head()
The second calculated column is return on investment (ROI), which I calculate as the proportion of profit values to budget values.
df_new['roi'] = df_new['profit'] / df_new['budget']
df_new.head()
Now that the data has been prepared, I will create visualizations to better interpret the trends, patterns, and relationships within the data set, as well as to extract insights. The questions I will answer:
For this business problem, I want to see a general representation of how the amount of movies released has changed over time. To demonstrate this, I will make a histogram displaying the total movies released by year.
Based on this visualization, I can conclude that over time, our movie releases have grown exponentially to the year 2000. However, after the early 2000s this number begins to decline. This could be caused by a variety of factors, but one likely one may have to do with the recession that effected the American economy around the years of 2008 to 2012.
#movies released per year
bins = 30
plt.hist(df_new['title_year'], bins, color='blue', label= 'Movies released by year')
plt.xlabel('Year')
plt.title('Movies released');
For this business problem, I want to get an idea of the types of movies that performed the highest in the data set. To demonstrate this, I will create a scatter plot displaying the movie titles by their gross sales and IMDB score. Since we assume in thhis project that IMDB score is an adequate portrayal of a movie's success, I will sort by these values first and then by top gross sales.
Based on this visualization, I can conclude that the top 10 movies from the data set are:
Clearly, some of these movies are part of a series of movies or are a sequel and its original, which shows that some story lines have more popularity than others.
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
top10mv = df_new[['movie_title', 'director_name','imdb_score', 'gross']]
top10mv = top10mv.sort_values('imdb_score',ascending=False).head(10)
px.scatter(top10mv, x="imdb_score", y="gross", text='movie_title', hover_name='movie_title', hover_data=['director_name','gross', 'imdb_score'], color="director_name", size='gross')
For this business problem, I want to get an idea of the movie directors that performed the highest in the data set. To demonstrate this, I will create a scatter plot displaying the directors by their gross sales and IMDB score. Since we assume in this project that IMDB score is an adequate portrayal of a movie's success, I will sort by these values first and then by top gross sales.
Based on this visualization, I can conclude that the top 20 directors are from the data set are:
These directors listed are prominent in the industry and anyone familiar with the movie industry would recognize some of these notable names. A couple interesting details:
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
top20dc = df_new[['movie_title', 'director_name','imdb_score', 'gross']]
top20dc = top20dc.sort_values('imdb_score',ascending=False).head(20)
px.scatter(top20dc, x="imdb_score", y="gross", text='director_name', hover_name='movie_title', hover_data=['director_name','gross', 'imdb_score'], color="director_name", size='gross')
For this business problem, I want to get an idea of the worst performing movies in the data set specifically in relation to the greatest losses. To demonstrate this, I will create a scatter plot displaying the movie titles by their gross sales and IMDB score. For this question, we will sort the subset of data by lowest profits (losses) and then by IMDB score.
Based on this visualization, I can conclude that some of the worst performing movies from the data set are:
These movies listed aren't always well-known movies, which would be a realistic assumption. However, there are some movies that are fairly well-known yet have some of the lowest profits that stood out to me:
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
last20mv = df_new[['movie_title', 'director_name','imdb_score', 'profit']]
last20mv = last20mv.sort_values('profit',ascending=False).tail(20)
px.scatter(last20mv, x="imdb_score", y="profit", text='movie_title', hover_name='movie_title', hover_data=['director_name','profit', 'imdb_score'], size='imdb_score')
For this business problem, I want to get an idea of how IMDB score varies among the different content rating levels. To demonstrate this, I will create a scatter plot displaying the data points by their gross sales and IMDB score for each rating level. For this question, I will sort the subset of data by IMDB score first.
Based on this visualization, I can draw some conclusions about the data set:
### top movies at each rating level by imdb score and gross, code based on https://plot.ly/python/basic-charts/
top_cr = df_new[['movie_title','imdb_score','gross','content_rating']]
top_cr = top_cr.sort_values('imdb_score',ascending=False)
px.scatter(top_cr, x="imdb_score", y="gross", hover_name='movie_title', hover_data=['movie_title','gross','imdb_score'], color='content_rating')
For this business problem, I want to get an idea of how IMDB score varies over time. To demonstrate this, I will create a scatter plot displaying the data points by year and IMDB score. For this question, I will sort the subset of data by release year first.
Based on this visualization, I can draw some conclusions about the data set:
#imdb scores through time, code based on https://plot.ly/python/basic-charts/
yr_prof = df_new[['movie_title','imdb_score','gross','title_year']]
yr_prof = yr_prof.sort_values('title_year')
px.scatter(yr_prof, x="title_year", y="imdb_score", hover_name='movie_title', hover_data=['movie_title','gross','imdb_score'], color='imdb_score')
In this visualization, I am comparing the density plots of IMDB scores at the different movie rating levels. Based on my findings:
#density plots of content ratings by IMDB scores
sns.kdeplot(df_new[df_new['content_rating'] == 1]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 2]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 3]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 4]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 5]['imdb_score'])
#count of each content rating level
rating = df_new.groupby('content_rating').size().reset_index()
rating = rating.replace({1: 'G', 2: 'PG', 3: 'PG-13', 4: 'R', 5: 'NC-17'}).rename(columns={0:'Count'})
rating
rating.plot(kind='barh').set_yticklabels(rating.content_rating);
I will be doing a correlation analysis to get a better understanding of the factors that are correlated with a successful movie. A successful movie will be determined as one having a high IMDB score. I will also look at the genres of movies that have the highest correlation with high IMDB scores as well. In the steps below, I will gather information about the data in the set to determine the correlations between the variables.
# data set basic statistics
df_new.describe()
#correlation analysis
df_new.corr()
# show correlation plot
plt.figure(figsize=(8,8))
sns.heatmap(df_new.corr(), vmax=.8, square=True, annot=True, fmt=".2f");
Based on this analysis, it is clear that there are some strong correlations between well-performing movies based on IMDB score and variables from the data set.
Positive relationship with IMDB score:
This means that movies with higher performance scores likely also have a longer duration and higher gross sales. Also, movies that perform well also have a higher number of votes from users, more critic reviews, and more likes on the movie's Facebook page.
# show correlation plot
plt.figure(figsize=(12,12))
sns.heatmap(df1.corr(), vmax=.8, square=True, annot=True, fmt=".2f");
I also created a separate data set to analyze the correlation between movie genres and IMDB scores. From this analysis, it is clear that some genres are more closely correlated with IMDB scores than others.
Positive relationship with IMDB score:
Negative relationship with IMDB score:
While the scores, aren't incredibly strong, it does show that movies in the Drama and Biography genres tend to have higher IMDB scores, while movies in the Comedy and Horror genres tend to have slightly lower IMDB scores.
Based upon the findings of this report, I have been able to dig deeper into the performances of movies through time as listed on IMDB. The insights gained from this data analysis range from unexpected to highly suspected.
Time-based insights:
IMDB score-based insights:
Movie rating insights:
Correlation insights:
Data preparation insights:
Steps: Regression Feature selection Stats model significance Lasso model Feature selection Kbest Random forest regressor
I will start by removing two categorical columns that will not be helpful in the following regressions, classifications, and clustering steps.
#remove unecessary columns
df_reg = df_new.drop(['movie_title','director_name'],axis = 1)
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','title_year','movie_facebook_likes','roi']]
To better inform the variables I select to put in my models and to better understand importances, I create a linear regression model and use Recursive Feature Selection.
#recursive feature selection for regression
lr = lm.LinearRegression()
rfe = RFE(lr, n_features_to_select=3)
rfe_y = rfe.fit(X,y)
print("Features sorted by their rank:")
print(sorted(zip([x for x in rfe.ranking_], X.columns)))
I create four models, one regression with all variables to act as a benchmark, and three others that take into account the top variables from the RFE results.
#First Model
runs_reg_model1 = ols("imdb_score~num_voted_users+duration+director_facebook_likes+title_year",df_reg)
runs_reg1 = runs_reg_model1.fit()
#Second Model
runs_reg_model2 = ols("imdb_score~num_voted_users+duration+title_year",df_reg)
runs_reg2 = runs_reg_model2.fit()
#Third Model
runs_reg_model3 = ols("imdb_score~num_voted_users+duration+director_facebook_likes",df_reg)
runs_reg3 = runs_reg_model3.fit()
#Full model
runs_reg_model = ols("imdb_score~color+num_critic_for_reviews+duration+director_facebook_likes+num_voted_users+cast_total_facebook_likes+num_user_for_reviews+language+country+content_rating+title_year+movie_facebook_likes+roi",df_reg)
runs_reg = runs_reg_model.fit()
#view model results
print(runs_reg.summary())
print(runs_reg1.summary())
print(runs_reg2.summary())
print(runs_reg3.summary())
From these results, it is evident that there is some multicollinearity as expected from seeing the correlation plot from earlier. Using the correlation plot, I tried to remove at least one variable from the model that was potenially collinear with another one in the model. This did reduce the accuracy of the model, as expected from removing an additional variable.
One way to resolve the issue of multicollinearity is to use VIF to identify those variables that are strongly collinear and remove one of from the model. Despite this skill not being taught to us yet, I attempted to do this multiple times, but was unsuccessful in coding this given its level of difficulty and lack of information about how to properly code it for this given situation. To attempt to emulate this strategy using the skills I do have, I identified a few variables that are highly correlated, and tried to reduce the amount of these in each model where possible to see if I could get a better result.
#create regression model with RFE identified important variables and remove those that might be collinear
y = df_reg['imdb_score']
X = df_reg[['duration','num_voted_users','num_user_for_reviews','budget','title_year','gross']]
#fit regression model
sk_model1 = lm.LinearRegression()
sk_model1.fit(X, y)
model1_y = sk_model1.predict(X)
# The coefficients
print('Coefficients: ', sk_model1.coef_)
# y-intercept
print("y-intercept ", sk_model1.intercept_)
coef = ["%.3f" % i for i in sk_model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
# stats model output for previous regression model
runs_reg_model3 = ols("imdb_score~duration+num_voted_users+num_user_for_reviews+budget+title_year+gross",df_reg)
runs_reg3 = runs_reg_model3.fit()
print(runs_reg3.summary())
While there are still many things to be improved, I did manage to get a better variance score withh a smaller numbeer of variables than in previous attempts before taking into account the variables that may be collinear.
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model1 = lm.LinearRegression()
model1.fit(X_train, y_train)
pred_y = model1.predict(X_test)
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
print("mean square error: ", mean_squared_error(y_test, pred_y))
print("variance or r-squared: ", explained_variance_score(y_test, pred_y))
Deploying this model would not likely be super successful, as I expected, due to the problems with multicollinearity.
# high values indicate multicollinearity
print(np.linalg.cond(runs_reg1.model.exog)) #multicollinearity
print(np.linalg.cond(runs_reg2.model.exog))
print(np.linalg.cond(runs_reg3.model.exog))
As shown above, the models show high values indicating that multicollinearity is present. I attempt to use the lasso method which penalizes having too many predictors.
# Fit the model
mod1 =lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
mod1.fit(X, y)
mod1_y = mod1.predict(X)
#get model info
mod1
#get coeff and y-int
print('Coefficients: ', mod1.coef_)
print("y-intercept ", mod1.intercept_)
# zip values in columns
coef = ["%.3f" % i for i in mod1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
# show MSE and variance
print("mean square error: ", mean_squared_error(y, mod1_y))
print("variance or r-squared: ", explained_variance_score(y, mod1_y))
Like the other models, there continues to be an issue with multicollinearity in each model, and changing them isn't bound to change this.
Then, using K-Best regression models and feature selection below, I make another attempt.
#select only 4 X variables
X_new = SelectKBest(f_regression, k=4).fit_transform(X, y)
X_new
# this helps us find out which variables are selected
selector = SelectKBest(f_regression, k=4).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)
# index(columns) 0, 1, 2 and 5 refer to constant, duration, num_voted_users, and title year respectively
X.head(2)
#new model regression
mod2 = lm.LinearRegression()
mod2.fit(X_new, y)
mod2_y = mod2.predict(X_new)
print("mean square error: ", mean_squared_error(y, mod2_y))
print("variance or r-squared: ", explained_variance_score(y, mod2_y))
This model again receives similar results to the other regression models.
# use f_regression with k = 3 and develop a new regression model
#f_Regression (Feature Selection)
X_new = SelectKBest(f_regression, k=3).fit_transform(X, y)
X_new
#regression model 2
mod2 = lm.LinearRegression()
mod2.fit(X_new, y)
mod2_y = mod2.predict(X_new)
print("mean square error: ", mean_squared_error(y, mod2_y))
print("variance or r-squared: ", explained_variance_score(y, mod2_y))
After this attempt, I decide it is time to use the Random Forest regressor model. This model, while having more of a blackbox method, often produces stronger results. It is difficult to sometimes explain where it derives its feature importances from though.
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','budget','title_year','movie_facebook_likes','gross','roi']]
#regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr = RandomForestRegressor()
regr.fit(X, y)
regr_predicted = regr.predict(X)
print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
As you can see, this has a much stronger result given it's high r-squared. The following is the list of feature importances in this model.
#RFR feature importances
sorted(zip(regr.feature_importances_, X.columns))
Using these results, I created a condensed model with fewer variables based on the feature importances listed above.
#new random forest regressor smaller model
y = df_reg['imdb_score']
X = df_reg[['duration','num_voted_users','num_user_for_reviews','budget','gross']]
regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)
print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
Impressively enough, this actually improves the r-squared from the previous model. I will display the feature importancesfor this model below.
sorted(zip(regr.feature_importances_, X.columns))
feature_importances = pd.DataFrame(regr.feature_importances_, index = X.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances
Though this model does perform significantly better than the others, this also brings into question what this model did so differently. Since it is a blackbox method, I'm not sure that this model will be the most effective to deploy and make predictions for the data set in the long run, especially since multicollinearity does influence regression models, but less so with classification and clustering. I will try these new types of methods that may provide some better insight.
In this section, I attempt to find a model with a stronger accuracy for deployment and prediction later for this data set. To start, I create a new column inthe data set that bins the IMDB scores into 4 categories by rating: less than 4, 4 to 6, 6 to 8, and 8 to 10; respectively Bad, Okay, Good, and Excellent.
#create bins for IMDB scores
df_reg['imdb_score_bin'] = pd.cut(df_reg['imdb_score'], bins=[0, 4, 6, 8, 10], labels=['0','1','2','3'])
# declare X variables and y variable, and remove both y variables from independent data set
y = df_reg['imdb_score_bin']
X = df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)
print(y.shape, X.shape)
In the next steps, I use the Decision trees classifier as a model, train it, and deploy it to see how well it can predict for the data set.
# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize DecisionTreeClassifier() ... name your decision model "dt"
dt = DecisionTreeClassifier()
dt
# Train a decision tree model
dt= dt.fit(X_train, y_train)
#size of train and test data sets
print(len(X_train), len(X_test))
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
# y-test is the acual y value in the testing dataset
# dt.predict(X_test) is the predicted y value generated by your model
# If they are same, we can say your model is accurate.
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()
This decision tree does significantly better than the previous regression models with an r-squared or accuracy of 67%. Now I will try using the K Nearest Neighbors model to classify against the data.
# evaluate the model by splitting into train and test sets & develop knn model
# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize knn()
knn = KNeighborsClassifier()
# Train a decision tree model
knn= knn.fit(X_train, y_train)
knn
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
The result of this model is slightly lower, so I will use GridSearch CV to make adjustments to this model.
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5, iid=False)
#fit model to training data
knn_gs.fit(X_train, y_train)
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_score_)
print(knn_gs.best_params_)
print(knn_gs.best_estimator_)
Implementing the best model from KNN enabled the accuracy to improve to be nearly equivalent with the first decision tree model. Since that initial decison tree model is a full model, I'd like to make a simpler tree.
# A simpler decision tree(max_depth=3, min_samples_leaf=5)
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize DecisionTreeClassifier()
dt_simple = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
dt_simple
# Train a decision tree model
dt_simple = dt_simple.fit(X_train, y_train)
dt_simple
# Find out the performance of this model & interpret the results
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt_simple.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt_simple.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt_simple.predict(X_test)))
This simpler decision tree has an even higher accuracy than the KNN model and full deciosion tree model. I display the new decision tree model itself below.
# decision tree visual
from graphviz import Source
from sklearn import tree
Source( tree.export_graphviz(dt_simple, out_file=None, feature_names=X.columns))
Like in the regression section, I will now use the Random Forest Classifier to see how well this model performs. Since it is a classification, the multicollinearity issue that occurred before will not have the same impact.
# declare X variables and y variable
y = df_reg['imdb_score_bin']
X = df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)
print(y.shape, X.shape)
# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = RandomForestClassifier(n_estimators=20) #building 20 decision trees
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)
# generate evaluation metrics
print(metrics.accuracy_score(y_test, clf.predict(X_test))) #overall accuracy
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))
print(metrics.classification_report(y_test, clf.predict(X_test)))
As I expected the random forest classifier does show an improved accuracy to the previous decision trees. Since its accuracy isn't significantly higher, I can be more secure in knowing that this wasn't due to multicollinearity problems.
I then decide to deploy this model to predict the IMDB score bins of the data set.
score=df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)
score.head(2)
# deploy your model for real world
predictedY = clf.predict(score)
print(predictedY)
#combine the predicted Y value with the scoring dataset
predictedY = pd.DataFrame(predictedY, columns=['predicted Y'])
predictedY.head()
#join the predicted and initial data set
data1 = score.join(predictedY)
data1.head()
#number of movies in each IMDB bin
data1.groupby('predicted Y').size()
#percent of movies in each bin
data1.groupby('predicted Y').size()/len(data1)
#visualization of number of movies in each bin (classification)
data1.groupby('predicted Y').size().plot(kind='bar');
From the above predictions, it is evident they reflect the real scores from the data set. There are a large number of movies classified as good, followed by movies classified as okay. There is a realitively small group that are classified as excellent, meaning they are quite selective with the movies that obtain this distinction.
In this section, I will look at clustering models to determine important variables in determining IMDB score, and identifying clusters that describe the movies in the data set. To prepare the data for clusterinng models, I need to do a variance test, then normalize the data before plugging it into the model.
# variance test
df_cl = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','title_year','movie_facebook_likes','roi']]
df_cl.var()
#normalize data
normalized_df=(df_cl-df_cl.mean())/df_cl.std()
normalized_df.head()
The Elbow method uses Kmeans to help determine how many clusters or profiles should be used when creating the models. Based on the image below, it appears that either 4 or 5 clusters would be best for the models.
#The Elbow method
#https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-scikit-learn#
#http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
#Computes distance between each pair of the two collections of inputs
from scipy.spatial.distance import cdist
K = range(1, 15)
meandistortions = []
for k in K:
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(normalized_df)
meandistortions.append(sum(np.min(cdist(normalized_df, kmeans.cluster_centers_, 'euclidean'), axis=1)) / normalized_df.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
First I make a model with 4 clusters.
# clustering analysis using k-means
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
k_means = KMeans(init='k-means++', n_clusters=4, random_state=0)
#fit data into model
k_means.fit(normalized_df)
# find out cluster centers
k_means.cluster_centers_
# convert cluster labels to dataframe
c = pd.DataFrame(k_means.labels_, columns = ['cluster'])
c.head()
# reset index for cluster set
c = c.reset_index(drop=True)
#verify lengths of both data set are equal before joining
len(normalized_df)
#verify lengths of both data set are equal before joining
len(c)
#join normalized data and cluster data
df2 = normalized_df.join(c)
df2.head()
#join original data and cluster data
data_cl = df_cl.join(c)
data_cl.head()
#group new data set by cluster and get average value for each variable
data_cl.groupby(['cluster']).mean()
#check how many observations in each cluster
df2.groupby('cluster').size()
The 4 profiles appears to be a good fit for the data set. When I tried doing 5 profiles, only one movie was assigned to the 5th cluster, so I determined that 4 was better for this model.
A description of the Movie Profiles
All Profiles:
Profile 1:
Profile 2:
Profile 3:
Profile 4:
I then do hierarchial clustering to show the 4 profiles.
#hierarchial clustering
#agglomerative clustering
np.random.seed(1) # setting random seed to get the same results each time.
agg= AgglomerativeClustering(n_clusters=4, linkage='ward').fit(normalized_df)
agg.labels_
#hierarcial cluster dendrogram
plt.figure(figsize=(16,8))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
linkage_matrix = ward(normalized_df)
dendrogram(linkage_matrix,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
#show_leaf_counts=False, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
orientation="top")
plt.tight_layout() # fixes margins
This shows how and which clusters would combine if we continually reduced the number of clusters.
Key Findings:
Regression: Multicollinnearity was pretty evident in this data set when conducting regression models. Though I tried to adjust for it, the r-squared remained relatively close between each model. The only regression model witht he highest r-squared or supposed accuracy was the Random Forest Regressor. Since this model is founded on an algorithm tht utilizes decision trees in its regression, it is more likely to have higher accuracy levels since it is less likely to be affected by multicollinearity. That being said, since the other models were around a 0.3 r square, and the Random Forest regressor had a 0.91 r-square, I am not sure what this blackbox method model did incredibly different from the other regression models such as lasso, RFE and others. Therefore, I decided that for this data set, one can either assume the Random Forest Regressor is unwittingly accurate, or this difference can make the model be considered less effective for this specific data set.
Classification: For the classification models used, I did a full decision tree, a simplified decision tree, KNN model, and Random Forest Classifier. Of the results for these, the model with the highest accuracy was the Random Forest Classifier in predicting the correct classification of the movies by IMDB score. It received a r-square score of 0.74 which is a pretty good model. In comparison, the accuracy of these models were much higher than that of the regressions, which means I think a classification model would be better suited to predict IMDB score level than regression models. The next best performing model was the simplified decision tree with a r-square of 0.69. Since this model and the Random Forest Classifier were much closer than that of the regression models and Random Forest regressor, it is more safe to assume that the Random Forest Classifier is still the best model.
Clustering: From the clustering analyses, I was able to determine that 4 profiels would be the best suited for this data set. While the Elbow method showed 5 as potentially being better, this variation only had one movie in the 5th cluster or profile, and I determined that this didn't do much incrementaly to improve the model so I reduced it back to 4 profiles. A basic overview of the profiles was: profile 1 being generally average in duration and age, yet it had some of the lwest number of critic an user reviews and second highest ROI. Profile 2 was typically shorter films with fairly good numbers fo votes and reviews, that had on average a breakeven on their ROI. Profile 3 garnered the most consumer and public attention with the highest number of votes, ratings, likes, and longest duration, however these movies on average lost money on the production with a negative ROI. Profile 4 were likely some of the smaller box office films with the lowest numbers of likes, reviews, and votes meaning they didn't have as much of the public's attention. These movies, however, did manage to have the highest ROI's of the 4 profiles.
Best Classification model: As explained above, the Random Forest Classifier would be the most accurate model to use from the classification section. If the person deploying these models didn't prefer to use Random forest models because of their blackbox method that doesn't show clearly the way variables are chosen and ranked, it would also be adviseable to select the next best model, the simplified decision tree.
Most important variables: Based on both regression and classification models, some of the most important variables in determining high IMDB scores and movie success were: duration, title year, number of voted users, gross, budget and user reviews. These variables were ever-present in the most successful models like the simple decision tree, Random Forest Regressor and Classifier models.
Recommendations for movie producers: For movie producers, I would recommend including a strong marketing campaign as those movies with the most reviews and votes and likes tended to have some of the best ratings. I also advise that movies that are slightly longer in duration and higher budgets usually had some of the higher IMDB scores. New movies tend to perform really well as far as IMDB scores go as well.
Need Additional variables: Some additional variables that could be beneficial to look at include time of year of release, especially in refence to peak times or holiday seasons to see if there is a seasonality trend that predicts the highest scores. Another variable might be whether or not or how many Oscar nominations it received. In reference to publicity, it may be interesting to loook at how much was spent on the marketing campaigns to see if this might be a good predictor of IMDB score since some of the higher scored movies had a lot of awareness.
All codes referenced from another source have the source directly listed in its specific cell.